SQL Server Deadlocks by Example

Comments 0

Share to social media

For each type of deadlock, we’ll review ‘typical’ deadlock graphs and discuss the signature that distinguishes each one, so that you can recognize it if you see it on your own systems. We’ll also consider the root causes of each type of deadlock, the code patterns that make them a possibility, how to avoid them recurring, and the need to deal with deadlocks, and all other SQL Server errors gracefully, with error handling and retries.

The Difference between Severe Blocking and Deadlocking

In my experience, developers and DBAs often think that their SQL Server instance is experiencing deadlocks when, really, it is experiencing severe blocking.

Blocking occurs when session A requests a lock on a resource (typically a row, page or table), but SQL Server cannot grant that lock because session B already holds a non-compatible lock on that resource.

For example, let’s assume that session B is in the process of modifying a row in the Invoices table. The session’s associated process (thread) currently holds an Intent-Exclusive (IX) lock on both the table and the page that contains the row, and an X lock on the row. Simultaneously, session A needs to read a few pages on same table. Its associated process acquires an Intent-Shared (IS) lock on the table (since IS and IX lock mode are compatible) and then attempts to acquire an S lock on the pages it needs to read. However, session B’s process holds an IX lock on one of the pages that contains some of the rows session A needs. S locks and IX locks are incompatible, and so session B’s thread blocks session A’s until the former completes its work and releases the locks.

Lock modes and lock compatibility

I don’t have space in this piece for a fuller discussion of lock modes and compatibility. See Further Reading at the end of the article for some useful references.

This is a transient situation and can be completely resolved by the session B completing its work and releasing its locks. It is possible to have extensive blocking chains where multiple sessions are blocked waiting for a session that itself is blocked waiting for another session that is blocked and so on, repeating multiple time. However, at the head of the blocking chain will be a head ‘blocker’ that is not waiting for a lock. It may be waiting for some other resource, such as a latch, memory, or IO, but at least one session will not be waiting for a lock, and the blocking chain will clear as soon as the head blocker can continue processing.

A deadlock is different; it occurs when two or more sessions are waiting for each other, in such a way that none can complete. A deadlock can be viewed as a circular lock chain, where every single process in the blocking chain is waiting for one or more other processes in that same blocking chain.

Consider the simplest possible deadlock, with two sessions, two processes and two resources (later sections will demonstrate deadlocks that are more complex). A deadlock occurs in two steps. In the first, each of the two processes requests and acquires a lock. This will be within a transaction, explicit or not, and hence neither process will release these locks immediately.

In the second step, each of the two processes requests a lock on the resource on which the competing session holds a lock that is incompatible with the requested lock.

At this point, process 1 cannot continue until it receives the lock that it wants on page 1:4224. It cannot get that lock until process 2 finishes and releases its lock on that page. Process 2 cannot continue until it receives the lock that it wants on page 1:1370. It cannot get that lock until process 1 finishes and releases its lock on that page.

At this point, neither process can proceed; we have a deadlock. Without intervention, these two processes would sit forever waiting for each other. Fortunately, SQL Server automatically detects deadlocks and intervenes on our behalf.

SQL Server’s Automatic Deadlock Detection and Resolution

SQL Server’s lock monitor has a deadlock detector that periodically checks the locks to see if there are any circular locking chains. If it finds any, it selects one of the sessions associated with a suspended thread, kills it, rolls back its transaction and releases its locks. This allows the other session to continue executing.

The killed session, known as the deadlock victim, receives error 1205:

Transaction (Process ID 75) was deadlocked on resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

The lock monitor picks the deadlock victim based, firstly, on the setting of DEADLOCK_PRIORITY for each session and, secondly (in the event of a tie) on the amount of work that it will take to roll back each of the open transactions.

The DEADLOCK_PRIORITY is a session-scoped setting that establishes the relative importance that the session completes its work should it become embroiled in a deadlock. It can be set to HIGH, NORMAL or LOW, with NORMAL being the default. If we’d prefer SQL Server not to pick a certain session as a deadlock victim, we can set its DEADLOCK_PRIORITY to high.

Integer values for DEADLOCK_PRIORITY

We can also set the DEADLOCK_PRIORITY to any integer value between -10 and +10, HIGH is equivalent to +5, NORMAL to 0 and LOW to -5. I recommend sticking with the named options. If someone is setting very fine-grained deadlock granularities, for example setting sessions to deadlock priorities 2, 3 or 7, there is likely a larger problem at play (i.e. lots of deadlocks and someone spending a lot of time prioritizing which processes are most important, rather than fixing what is causing the deadlocks).

If two sessions deadlock, the lock monitor will select as the deadlock victim the one with the lower value for DEADLOCK_PRIORITY. If each has identical values for DEADLOCK_PRIORITY, then the lock monitor considers the resources required to roll back the ‘competing’ transactions; the one that requires the least work to roll back will be the deadlock victim. The lock monitor takes no account of how long a transaction has been running or how much work it has done; just the cost of rolling it back. If a deadlock occurs between session A, which has been running a SELECT for an hour, and session B that is running a single-row UPDATE, and both have the same DEADLOCK_PRIORITY, session A will be the deadlock victim as it made no data changes and hence costs nothing to roll back.

Preventing Deadlocks

It’s convenient that SQL Server will detect and resolve deadlocks automatically, but that doesn’t mean a DBA can just ignore them. After all, the deadlock victim may well be an important business operation and its failure to run will cause considerable disruption.

A DBA needs to know when a deadlock occurs in one of their SQL Server instances by alerting on 1205 errors, to capture the information, the deadlock graph, which will detail which processes were involved in the deadlock, and then set about trying to ensure that it does not occur again.

How to Capture a SQL Server Deadlock Graph

A deadlock graph shows us the sessions and resources that were involved in a deadlock. Rather than repeat information ably covered elsewhere, I’m going to refer you to Jonathan Kehayias’ article, Handling Deadlocks in SQL Server, for background details of the various techniques by which to capture deadlock graphs, including various Trace Flags, the Profiler deadlock graph event, Service Broker event notifications, and Extended Events.

Prior to SQL Server 2008, if a deadlock occurred in SQL Server, we’d have to enable trace flags, or set up a server-side trace, and wait for the deadlock to recur. In SQL Server 2008 and later, we can retrieve deadlock graphs retrospectively from the extended events system_health session. I used the technique in this article, as frankly it is the most straightforward way to get the deadlock graphs, using Jonathan’s query. Listing 1 shows how to return the deadlock graph from the ring_buffer target thesystem_health event session (the code download also contains the equivalent code for the event_file target). This version of the query is specific to SQL Server 2012; see the previously referenced article for the 2008 version.

Listing 1: Returning the deadlock graph from the system_health event session

The deadlock graph obtained from the system_health extended events session is extremely similar both to the error log output for traceflag 1222 and to the XML captured by SQL Profiler when it is tracing for the deadlock graph event.

How to Read a Deadlock Graph

In the next section, we’ll start our tour of common types of deadlocks and their resolution. In order to follow along, you’ll need to know your way around a deadlock graph, so let’s take a brief tour.

I generated an example deadlock by executing two stored procedures, UpdateCustomerLatestOrderStatus and AddOrder (which we’ll use again, and discuss in more detail, later) though any simple deadlock will do at this early stage, since we’re only interested at this stage in the overall structure of the resulting deadlock graph.

Generate a deadlock and then run retrieve the deadlock graph, for example by running Listing 1 to retrieve it from the system_health event session. Figure 1 shows my deadlock graph, in XML format.

1926-8252646b-cc9d-4f3b-85a6-dff9c3af700

Figure 1: A sample deadlock graph showing the processes and resources sections

The Extended Events live data viewer, built into SSMS in SQL Server 2012, as well as tools such as Profiler, can present a GUI representation of the deadlock graph. However, I prefer looking at the ‘raw’ XML version. While most of the information in the deadlock graph is accessible from the GUI representation, it’s not as easy to find and is not all in one place. I find myself clicking on the various sessions and resources multiple times to get the whole picture.

An XML deadlock graph has two main sections, labelled (1) and (2) in Figure 1.

  1. Processes section – details all the processes involved in the deadlock, what they were running, what isolation level they were in and more
  2. Resources section – lists all the resources that were involved in the deadlock, which locks each process acquired and which locks they requested.

I like to start by looking at the resources section, to see which resources the processes were fighting over and what types of locks they were requesting. Once I have a picture of what locks were involved, then I go back and look at the details of the processes to see what they were doing.

There will be at least two entries here, but there may be more. Each entry starts with a description of the resource and then lists the processes that either held a lock on that resource or requested a lock on that resource. Locks here will mainly be key, RID, page or table, with more exotic range locks possible if the query was running under SERIALIZABLE isolation. It’s also possible to have non-lock related resources here, especially if either a query was running in parallel. Start by noting the granularity and mode of the locks and the objects involved. Are there table level locks held or requested? How many different tables are involved? How many indexes are involved?

Next, we move on to the processes section, which contains an entry for every thread involved in the deadlock. Notice that I say thread, not session, because if a session is running a parallel query, it can appear multiple times in this section. This section of the graph provides a wealth of information, including login names, host names, isolation level, times, session settings and more.

The most useful information, generally, is the isolation level under which each query was running and the details of which statement completed the deadlock. I say ‘completed’ the deadlock, because the statement listed in the deadlock graph can be just the one that the session was running at the point that the deadlock detector identified that this session was part of a deadlock.

Common types of SQL Server Deadlocks

Here we begin our dissection of the most common types of SQL Server deadlock. We’ll review deadlock graphs for each, discuss what distinguishes each type, and then consider what causes them and how to avoid them in future.

In each case, the deadlock graphs I present are representative of real graphs, produced on real SQL Server instances operating under concurrent workloads. Some require some tricks and contrivances to reproduce on an isolated test instance, other are simply very hard to reproduce ‘on demand. The code download for this article contains code samples that will allow you to reproduce the reader-writer, writer-writer, key lookup and serializable deadlock types.

However, please bear in mind, firstly, that your output won’t match exactly what I present in this article, though the basic signature will be similar. Secondly, that the code download examples are contrived specifically to produce the deadlock. They in no way represent code you’d hope or expect to see on a production system.

Reader-Writer Deadlocks

A reader-writer deadlock is a deadlock between a statement that is reading and a statement that is performing some form of data modification.

When you look at the resources involved, you’ll see that the signature of this form of deadlock is locks that are all either a shared (S) lock granted and an exclusive (X) lock requested or an X lock granted and an S lock requested.

In its default isolation level (READ COMMITTED), SQL Server hold S locks only until the statement completes. As such, reader-writer deadlocks are most prevalent in code that has data modifications followed by a select query, within a transaction, or in code that has requested a higher isolation level, meaning that SQL Server holds S locks till the end of the transaction, either intentionally or because the client libraries default to a higher isolation level.

There is one very easy fix for this form of deadlock, and that is to use a row-version based isolation level, either READ COMMITTED SNAPSHOT or SNAPSHOT. In the row-version based isolation levels, readers do not take locks and instead use row versions for isolation. No shared locks means no reader-writer deadlocks.

Transaction Isolation levels

Again, there is not room in this article for a more detailed description of the different transaction isolation levels, and how each prevents read phenomena, such as dirty reads, non-repeatable reads and so on. See Further Reading.

The sample deadlock graph in Figure 1 is, in fact, one generated by a reader-writer deadlock and we’re going to drill into the details of that graph here. As discussed earlier, we’ll start with the resources section.

Listing 2: The resources section for a reader-writer deadlock

We have two processes here, for the moment I’m going to call them 48 and c8 (the last two letters of their owner_id). There are two resources, page 649 in database 23 and page 192 in database 23. With no object name given for the page lock (key locks and object locks give the name), we’re going to have to do a little bit of work to do to identify the table.

The first step, however, is to identify the database, using the db_name() function. Once we have the database name, we can use the associatedObjectID (which in this case is not an objectID) to get the table name.

Listing 3: Obtaining object names from partition IDs.

We can identify the second resource the same way and it turns out that the involved tables were Customers and Orders, the clustered index of both.

To work out the sequence of events that lead to the deadlock, we look at the owner-list and waiter-list for each resource. The process listed in the owner-list is the one that had the lock, the process or processes in the waiter-list are the ones that had requested the lock and were waiting for it.

Using those lists, we can see that Process 48 had an exclusive lock on the page in Orders and Process c8 had an exclusive lock on the page in Customers. That was the first step. Then Process 48 requested a read lock on the locked page in Customers and Process c8 requested a read lock on the locked page in Orders.

At this point, even without looking at the processes section, there’s enough information to consider potential fixes. If the order of either, or both, sets of statements were reversed and the queries are running in READCOMMITTED, then this deadlock wouldn’t occur because under that isolation level shared locks are released no later than the end of the query that requested the locks and so the shared locks would be released before the update started. If we could move either SELECT outside the transaction, then this deadlock wouldn’t occur. These won’t necessarily be the actual solutions implemented, but they’re worth keeping in mind.

For now, however, let’s move on to the process section. In Listing 4, I’ve removed some bits from the XML to keep the size down and applied some manual formatting to make the listing easier to read on the page.

Listing 4: The processes section for a reader-writer deadlock

There’s a lot of information in there. To start, the process ID matches the process IDs listed in the resources section. We can see the client application, host name and login name of both sessions. If multiple applications use the server, this can help narrow down the culprit. Occasionally, you may find that the deadlocks originate from ad-hoc queries from Management Studio. In this case, fixing the deadlock may be as simple as asking the user to stop running that query or to run it elsewhere or at another time.

We can see that the transactionname is user_transaction, indicating that the code formed part of an unnamed, explicit transaction.

The waitresource shows the database, objects and pages on which the processes are deadlocked, and reflect what we saw in the resources section. If the isolationlevel indicates that an application or procedure has requested a higher isolation level, it’s worth investigating whether or not this is a true requirement, or just a default, but unnecessary, setting.

The input buffer (inputbuf) lists which statements each session sent to SQL Server. In both cases, in this example, it’s a call to a stored procedure, so our next step is to investigate these procedures. Process c8calls the procedure UpdateCustomerLatestOrderStatus and process 48 calls the procedure AddOrder.

Listing 5: The UpdateCustomerLatestOrderStatus and AddOrder stored procedures

For the purposes of this article, please try to ignore the complete lack of error handling in these procedures. If they were real production code, there would be a lot more verification, error handling and checks. If they also appear oddly written, that’s intentional as I wrote them in a way that ensured they would cause a deadlock.

To understand the deadlock, we need to match the code that ran to the locks listed in the resources section of the deadlock graph.

We’ll start with process c8, which the resources section told us had taken an exclusive (X) lock on a page in Customers and then requested a Shared (S) lock on a page in Orders. Since UpdateCustomerLatestOrderStatus contains only two queries, we can deduce easily that the X lock this process hold results from the update of customers and the shared lock it requested results from the select on Orders.

Process 48 first took an X lock on the page in Orders, which would be the INSERT into Orders. It then requested an S lock to perform a select against customers, but could not proceed as process c8 had an incompatible lock on the page it needed. As this point, we had a deadlock and process 48 never even reached the subsequent update of customers.

Now we know what caused the deadlock, it’s relatively easy, in this case, to prevent it. Let’s start with the UpdateCustomerLatestOrderStatus stored procedure. The explicit transaction in this procedure is only necessary if there are multiple data modification statements that need to form an atomic unit, or if the result of the update modification could affect the result of the subsequent select. In fact, we have a single data modification followed by a select on a different table, which will return the same results regardless of whether the update commits or rolls back. We can also safely say that there’s no trigger because there are no indication of it in the deadlock graph (it would appear in the executionStack sub-section).

In short, we can remove the explicit transaction, as shown in Listing 6.

Listing 6: Remove the explicit transaction from UpdateCustomerLatestOrderStatus

This procedure can no longer cause the deadlock. However, to be sure, let’s fix the AddOrder procedure too. This one’s a little harder.

The select is against the same rows as the update right after it. If we look at what it’s doing, the select is returning the customer row as it will be after the update completes (it’s specifying the LatestOrderStatus as A, which is the value to which the update sets it).

There’s a potential bug here though. If that update never commits, the select may return ‘dirty’ data. This may be the developer’s intent but probably isn’t, so I’ll move the select outside the transaction and remove the hardcoded value for LatestOrderStatus, just letting the select return the value that’s in the table.

Listing 7: Modifying the AddOrder stored procedure to prevent deadlocks

If the exact current behavior is required and correct, I could instead move the select so that it runs as-is, after the transaction commits.

Now these procedures will no longer deadlock, but there is still one potential problem. These two procedures still access the same objects but in different orders. UpdateCustomerLatestOrderStatus touches Customers first then Orders, and AddOrder does the reverse order. If a future change wraps the first procedure’s contents in a transaction again, they may well start deadlocking again.

One general rule for preventing deadlocks is always access objects in the same order, so let’s make one more fix to UpdateCustomerLatestOrderStatus.

Listing 8: Modifying UpdateCustomerLatestOrderStatus so that it accesses objects in the same order as AddOrder

That should ensure that these two procedures never deadlock again.

Writer-Writer Deadlocks

In a writer-writer deadlock both the granted lock and requested lock on a resource are update or exclusive locks. In other words, both operations attempt data modifications.

One important thing to note when investigating writer-writer deadlocks is that SQL Server holds exclusive locks until the transaction commits, unlike shared locks which in the default read committed isolation level SQL Server holds no longer than the end of the statement (and can in fact be released as soon as it reads the row, before the statement completed).

Note also that the Snapshot isolation levels won’t help us with writer-writer deadlocks, as these levels affect only SQL Server’s behavior with regard to S locks. SQL Server will still take exclusive locks for data modifications.

This aside, we adopt more or less the same approach to fixing a writer-writer deadlock as we did for fixing a reader-writer deadlock, so I’m not going to go into immense detail. Again, we’ll start with the resources section of the deadlock graph.

Listing 9: The resources section of a writer-writer deadlock

All locks involved are exclusive (X) locks. This means we can’t consider one of the snapshot isolation levels, nor will we be able to fix this by moving statements outside of a transaction.

Once again, we see that two processes (c8 and 08) engaged in the deadlock. Once again, we use the db_name() function and Listing 3 to identify the objects involved and it’s the tables Orders and Customers.

According to the resource section, the order of events was as follows

  1. process 08 takes an exclusive lock on a page in Orders
  2. process c8 takes an exclusive lock on a page in Customers
  3. process 08 requests an exclusive lock on a page in Customers
  4. process c8 requests an exclusive lock on a page in Orders.

The processes section is near identical to the one for the reader-writer deadlock with the primary difference being in the content of the input buffers.

Listing 10: Processes section of a deadlock graph for a writer-writer deadlock

It’s our old friend AddOrder, fixed to prevent reader-writer deadlock, but now engaged in a writer-writer deadlock with a different procedure, DispatchOrder, shown in Listing 11.

Listing 11: The DispatchOrder stored procedure

The sequence of the deadlock is as follows

  • Process 08, running AddOrder, takes an exclusive lock on the page in Orders.
  • Process c8, running DispatchOrder, takes an exclusive lock on the page in Customers
  • Process 08 requests an exclusive lock on the page in Customers.
  • Process c8requests an exclusive lock on the page in Orders, resulting in a deadlock
  • The SELECT in DispatchOrder was never reached

Once again, the root cause is accessing the same objects in different orders. At this point, if this were a real system, I’d be making a change to the coding standards document mandating that all transactions access the Orders table before the Customers table.

To fix this deadlock I’m going to make two changes. First, pull the select outside of the transaction so that it returns the committed data, rather than returning data containing modifications that might roll back. Second, I’m going to ensure that DispatchOrder, like the other two accesses Orders first and then Customers.

Listing 12: Modifying DispatchOrders to avoid deadlocks

Since a transaction is, by definition, an atomic operation, which of the two updates runs first or second has no impact on the overall results, at least in cases like this where the two updates are independent. As such, the change I’ve made won’t change the behavior of the procedure.

So far, I’ve made no mention of tuning the procedures simply because the sample procedures that I wrote to simulate deadlock-prone code are already about as optimal as possible. However, when dealing with deadlocks generally, tuning the procedures is a critical part of the solution, even if there’s an obvious cause, as in these examples. Any time a query takes more locks than necessary, or holds locks longer than necessary, there’s an increased chance it will deadlock with another processes, and the busier the server, the higher the chance that will happen.

Of course, query tuning is a huge topic, and not specific to deadlocks, so I’m not going to cover it here. See the Further Reading section at the end of this article for some useful references.

Key Lookup Deadlocks

Technically key-lookup deadlocks are a special case of reader-writer deadlocks, but they are worth addressing separately as, unlike the earlier examples, this type of deadlock does not require that the processes are running multiple statements within a transaction. With a key lookup deadlock, it’s perfectly possible for a session running a single select to deadlock with a session running a single update statement. The reason this is possible relates to both SQL Server’s index architecture and the order in which the query processor runs operations.

Let’s look at an example deadlock graph for a key lookup deadlock. As is our custom, we’ll review the resources section first.

Listing 13: Resources section of a Key Lookup deadlock graph

One key point to note about the resources section is that there’s only one table involved, and two indexes on that table:

  • Process 2f8025498 holds a shared lock on the non-clustered index (idx_InvoiceNumber) and has requested a shared lock on the clustered index
  • Process 2efb09498 holds an exclusive lock on the clustered index and has requested an exclusive lock on the non-clustered index

Listing 14 shows the processes section of the deadlock graph.

Listing 14: Processes section of a Key Lookup deadlock graph

The process list reveals only two statements, a query and an update. We can see from the executionStack that, this time, we’re dealing with ad-hoc SQL. We see each statement listed twice in the executionStack (once as the entire batch submitted, once as the statement that completed the deadlock) and again in the inputbuf (as the command sent to SQL Server). The two statements are as follows:

And:

There are no other queries involved here, no explicit transactions. These are two individual statements deadlocking.

Process 2f8025498 is reading the Invoices table via the non-clustered index, which happens to be non-covering. As such, it also needs to retrieve data from the clustered index. In such cases, SQL Server’s query processor first takes a lock on the non-clustered index, reading the index rows, then takes locks on the associated rows in the clustered index and performs the lookup.

When a data modification occurs, SQL Server first locks and modifies the rows in the clustered index, and then locates the non-clustered indexes that it needs to change, takes locks on those rows and modifying them. As a result, we can have a race condition that can lead to a deadlock.

In this case, the SELECT, using a non-covering index on InvoiceNumber, took a shared lock on the index key for the InvoiceNumber passed. The UPDATE took an exclusive key lock on the clustered index for the InvoiceID it was passed. The InvoiceNumber and InvoiceID happened to belong to the same row. The select then requested a shared lock on the row in the clustered index to do its key lookup, but was blocked by the exclusive lock that the UPDATE held. The update then requested an exclusive lock on the non-clustered index to change the InvoiceStatus, which is a key column in that index, but the shared lock held by the SELECT meant that SQL Server could not grant the request. At this point, the deadlock detector identifies the deadlock and terminates the select.

These deadlocks are annoying to fix, because technically the developer has done nothing wrong, nothing to cause the deadlock. To fix this, we need to make the non-clustered index a covering index. In the example, the query is a SELECT *, so this will be difficult and probably inefficient, and so fixing the deadlock will involve figuring out which columns are actually needed, replacing the SELECT * with a SELECT just of those columns and then making the index covering.

If it were not possible to cover the query, then some retry logic on the select would work as well. The lock manager will not choose the update as the deadlock victim, by default, because it requires more work to undo than the SELECT.

Since these are a special case of reader-writer deadlocks, using one of the row-versioning based isolation levels for the SELECT will also resolve this form of deadlock.

Any deadlock can involve queries running in parallel. When that happens, the deadlock graph can become massively complex, leading to multiple additional entries in the resources section, with parallelism-related wait types, and multiple additional entries in the processes section. The key to interpreting a deadlock graph polluted with parallelism is to ignore all the parallelism-related sections.

Listing 15 shows the resources section of a deadlock graph for a deadlock that involved queries running in parallel.

Listing 15: Resources section for a parallelism-related deadlock graph

The exchangeEvent resources are parallelism exchange operators. These extra entries can make the deadlock graph very confusing and a lot harder to read. However, notice that all the locks granted or requested are either X or U meaning that this is a “writer-writer” deadlock.

There are two types of parallelism-related deadlocks. The first is a normal deadlock where one of more of the queries are running in parallel. The second is an intra-query parallelism deadlock and occurs when a single, parallelized query deadlocks with itself.

To identify whether or not this is an intra-query parallelism deadlock, we examine the processes section and see how many different values there are for the spid attribute. Listing 16 shows the massively cut-down processes section of the deadlock graph.

Listing 16: Processes section for a parallelism-related deadlock graph

The process list reveals two processes, spid 52 and spid 53. In effect, this is a normal writer-writer deadlock, but with one of the processes, spid 52, running in parallel, as indicated by the duplicated process entries. Just ignore the parallel resources and debug this in the same way as the previous writer-writer deadlock.

Let’s compare that with the other kind of parallelism-related deadlock. The processes section of an intra-query parallelism deadlock graph (see Listing 17) will reveal only a single spid.

Listing 17: Processes section for an intra-query parallelism deadlock graph

The fact that the intra-query variety of parallelism deadlock can occur at all is a ‘bug’ in the query processor. Such deadlocks shouldn’t be able to occur, but there are cases where they will.

If such a deadlock occurs regularly, the solution is first to tune the query, the idea being that if the query is more efficient then SQL Server won’t need to parallelize it as much or at all and perhaps that will remove the conditions that lead to the deadlock.

If that doesn’t work, then adding a MAXDOP hint to the query to force it to run serially will remove any chance of an intra-query parallelism deadlock, though possibly at the cost of a longer execution time.

Finally, it’s also a good idea to ensure that you’ve patched SQL Server to the latest service pack.

Range Scans and SERIALIZABLE deadlocks

The SERIALIZABLE isolation level is prone to deadlocks because it takes restrictive locks and holds all locks for the duration of the transaction. Any time you see in a deadlock graph a process with isolationlevel="serializable(4)", your first act in the debugging task should confirm whether or not that transaction really needs to run in SERIALIZABLE mode, and if not then whether switching to a lower isolation level resolves the deadlock.

If use of SERIALIZABLE is unavoidable then we’re back to an exercise in tuning the statements and structures that are involved in the deadlock. In this regard, one very common query pattern is very prone to deadlocks resulting from the SERIALIZABLE range locks.

When a transaction runs in SERIALIZABLE isolation level, SQL Server has to prevent phantom rows from appearing in a resultset. It acquires Range locks in order to prevent another transaction from inserting rows into ranges that the SERIALIZABLE transaction is reading. It acquires these locks on a range of the relevant index, or locks the entire table if there isn’t a relevant index.

These range locks appear as RangeS-S, RangeS-U, RangeI-N or RangeX-X locks (again, see Further Reading for more information on these lock types). For our deadlock example, we are interested, specifically in the RangeS-S and RangeI-N locks.

SQL Server acquires the RangeS-S lock when a select query requests a specific row, and is running under SERIALIZABLE isolation.

Under SERIALIZABLE isolation level, if that query returns zero rows, then any time it is run again within the transaction it must also return zero rows. Locking the rows where CustomerName = @p1 doesn’t suffice and so SQL Server locks the range in the index which contains the value @p1, it locks the entire range between the index key value before @p1 to the index key value after @p1. Any insert into that range, for values of CustomerName = @p1, or any other value in the range, is blocked.

Likewise, an INSERT running under SERIALIZABLE isolation level has to acquire, briefly, RangeI-N locks until it finds the correct location to insert the new row. If a SELECT statement already holds a RangeS-S lock on the same range, it will block the acquisition of the RangeI-N locks, since RangeS-S and RangeI-N locks are incompatible. Now imagine multiple transactions trying to read and then insert into the same range; it’s a recipe for deadlocks.

Having set the scene, let’s look at an example of the problematic pattern.

Listing 18: The ‘Upsert’ pattern

Initially it looks fine (other than a lack of error handling). It’s a traditional ‘upsert’ operation where we insert the row if it doesn’t exist, or update it if it does. There’s a unique index on CustomerName and the rowlock hint ensures that SQL will only lock a single row. Right?

Well, no, not under SERIALIZABLE isolation level. Let’s say that the CustomerName that we’re checking is ‘Matthew Green’, the value immediately below that in the index is ‘Leonard Kotz’ and the value immediately above it is ‘Nicolas van Schalkwyk’. Under SERIALIZABLE isolation level, SQL will lock the entire range from ‘Leonard Kotz’ right through to ‘Nicolas van Schalkwyk’ and subsequently block any attempt to insert any value within that range.

If we have two different sessions trying to insert different customers in that range then both sessions will request and receive RangeS-S locks (it’s a shared lock so multiple sessions can hold it). Both sessions then try to insert the customer rows, with different customer names. To do so, each needs a RangeI-N lock. This lock is not compatible with the RangeS-S that both sessions hold over the same range and so we get a deadlock.

Listing 19 shows a typical deadlock graph.

Listing 19: Deadlock graph for Range scan and SERIALIZABLE deadlock

The key indicator that we have a deadlock resulting from SERIALIZABLE range locks is the presence of this row:

The process is attempting to convert a lock from RangeS-S to RangeI-N to do the insert and it can’t.

So how do we fix this problem? There are a couple of ways. One way is to ditch the entire IF EXISTS... construct and instead use a MERGE statement. A MERGE operation as a single atomic operation, it doesn’t need SERIALIZABLE isolation level to work correctly and isn’t prone to these deadlocks. On the downside, there are some annoying bugs with MERGE.

Another option is to ditch the IF EXISTS and use two statements, an update and an insert.

Listing 20: Replace IF EXISTS with separate UPDATE and INSERT statements

This may look strange, especially the SELECT with a WHERE but no FROM, but this is valid. Now that we have two separate statements without a transaction, the worst that can happen is that the INSERT is blocked; it can no longer cause deadlocks.

Finally, if we want to keep the IF EXISTS structure, we can add another locking hint.

Listing 21: Using the UPDLOCK hint

Update locks are not compatible with other update locks and so this serializes the entire process. The second process to run is blocked right at the beginning by the locks that the first one has taken. Until the first one completes the entire process, the second one cannot start to run.

Partition escalation deadlocks

This type of deadlock is only possible on a partitioned table where the table’s lock_escalation option has been set to AUTO, which on a partitioned table allows lock escalation to go to the partition level rather than the table.

The default lock_escalation option, TABLE, does not allow this deadlock to occur and the fact that this deadlock can occur when lock escalation is set to AUTO is the reason why it’s not the default on partitioned tables!

Before we dig into this one, we need a very quick summary of lock escalation (see Further Reading for more). SQL Server will start locking at the lowest granularity it thinks is reasonable for the operation in hand. Typically, it uses row or page locks. However, if the number of locks on a single object exceeds a threshold (currently 5000), or the memory used for locks exceeds a percentage of SQL’s memory (currently 40%), then it triggers lock escalation. SQL Server will try to replace the row or page locks on the object with a table-level or partition-level lock, depending on the setting of the table’s LOCK_ESCALATION option, and on whether or not the index is partitioned.

Let’s look at a deadlock graph (yes, this is a contrived example).

Listing 22: Deadlock graph for a partition escalation deadlock

What happened here is that each of the updates affected only one partition. At the point that the escalation threshold was hit, SQL Server escalated the initial row locks to partition-level locks (because partition-level locking is enabled). At the point that the updates finished, each session had an exclusive lock on a different, single partition of the table. The select then requires a lock on the other partition of the table, which causes the deadlock.

Essentially:

  • Process 1 requests row-level X locks on Partition 1 of a table
  • Process 2 requests row-level X locks on Partition 2 of a table
  • Process 1’s locks are escalated to partition-level
  • Process 2’s locks are escalated to partition-level

At this point, each process holds an exclusive lock on a different partition of the table

  • Process 1 requests a shared lock on a row in Partition 2 of the table. It is blocked.
  • Process 2 requests a shared lock on a row in Partition 1 of the table. It is blocked

Now we have a deadlock and SQL Server’s deadlock detector identifies it and kills one of the sessions

If the lock_escalation option was set to the default of table, this deadlock would not occur. With the lock escalation set to table, if both updates had started taking locks and then one or both triggered a lock escalation, the escalation to table locks would fail. Locks cannot be escalated to table if there are any incompatible locks already on any of the rows or pages in the table. With the escalation having failed, the X locks that each update took would stay as row locks. The select, which then needed a lock on a row not affected by the update, could then get the lock it needs and both queries would complete without deadlocking

With deadlocks like this, where one of the locks is at the HoBT level, the first step should be changing the index’s lock settings to escalate to table. If the deadlock is caused by partition-level lock escalation then this change will remove the chance of deadlock occurring. If the deadlocks still occur even with escalation set to table, then the deadlock should be debugged as for a reader-writer or writer-writer deadlock, as detailed earlier.

Handling deadlocks

The best way to deal with deadlocks is via error handling and retries. All application code that deals with database queries should have error handling. To be honest, all application code should have proper error handling, but I might as well also wish for peace on earth and an end to poverty.

If application code assumes that any database query will succeed, that code is brittle and a potential problem because there are a number of issues that can cause database calls to fail.

Any form of high availability failover drops connections. If the application assumes that a connection is open and runs a query it will get an error. If the application doesn’t handle the error properly, the application can crash or the user can get a message about something they probably don’t understand and almost certainly don’t care about. Likewise, deadlocks, or update conflicts, depending on isolation level, can cause a query to fail. If the application doesn’t handle the error properly, the consequences are the same.

Application code should have error handling and retry logic on any data-access code. If a query fails, the error handling code should look at what error was raised and decide, based on that error, whether to retry the query, or log the failure and send the user a meaningful error (see Further Reading).

Summary

I hope that this article will serve you as a good reference piece, in recognizing and resolving the different types of deadlock that can afflict your SQL Servers, and the business operations that run on them. I hope it will also help you to identify the sort of coding mistakes and (anti)patterns that make them more likely to occur, especially on busy systems with a high level of concurrent access.

If you can deploy on your SQL Server instances one of the row-versioning isolation levels (SNAPSHOT or READ_COMMITTED_SNAPSHOT), then by definition you rule out the possibility of any form of Reader-Writer deadlock, since readers won’t acquire Shared locks in these modes.

Beyond that, avoiding deadlocks is largely an exercise in query and index tuning. I haven’t had room in this piece to digress too far into this huge topic, but the references I provide in Further Reading, below, should help direct your efforts.

Further Reading

Load comments

About the author

Gail Shaw, famous for her forum contributions under the pen name 'gilamonster', is from Johannesburg Gauteng and works as a database consultant in performance tuning and database optimisation. Before moving to consulting she worked at a large South African investment bank and was responsible for the performance of the major systems there.